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Abstract 



In machine learning, Domain Adaptation (DA) arises when the distribution gen- 
erating the test (target) data differs from the one generating the learning (source) 
data. It is well known that DA is an hard task even under strong assumptions Q, 
among which the covariate -shift where the source and target distributions diverge 
only in their marginals, i.e. they have the same labeling function. Another popular 
approach is to consider an hypothesis class that moves closer the two distributions 
while implying a low-error for both tasks |2|. This is a VC-dim approach that 
restricts the complexity of an hypothesis class in order to get good generalization. 
Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to 
be given to each hypothesis in order to build a majority vote. We prove a new DA 
bound in the PAC-Bayesian context. This leads us to design the first DA-PAC- 
Bayesian algorithm based on the minimization of the proposed bound. Doing so, 
we seek for a p- weighted majority vote that takes into account a trade-off between 
three quantities. The first two quantities being, as usual in the PAC-Bayesian ap- 
proach, (a) the complexity of the majority vote (measured by a Kullback-Leibler 
divergence) and (b) its empirical risk (measured by the p-average errors on the 
source sample). The third quantity is (c) the capacity of the majority vote to dis- 
tinguish some structural difference between the source and target samples. 



Preliminaries 

Domain Adaptation. We consider DA for binary classification tasks where X C R d is the input 
space of dimension d and Y={—1, 1} is the label set. We have two different distributions over XxY 
called the source domain Ps and the target domain Pt- Ds and Dt are the respective marginal 
distributions over X. We tackle the challenging task where we have no information about the label 
on Pt- A learning algorithm is then provided with a labeled source sample S = {(x|, yf)} 7 j r l 1 drawn 
Ltd. from Ps, and an unlabeled target sample T = {x^ }JL 1 drawn i.i.d. from Dt- Let h : X — ^ Y be 
an hypothesis function. The expected source error of h over Ps is the probability that h commits an 
error, Rp s (fe) = E( xSj2/S )^p s /(/i(x s ) ^ y s ) , where 1(a) = 1 if predicate a is true and otherwise. 
The expected target error Rp T over Pt is defined in a similar way. Rs is the empirical source error. 

The DA objective is then to find a low error target hypothesis, even if no label information is avail- 
able about the target domain. Clearly this task can be infeasible in general. However, under the 
assumption that there exists hypothesis in the hypothesis class % that do perform well on both the 
source and the target domain, Ben David et al. 1 2 ] provide the following guarantee, 

VfteW, Rp T (h) < RpsW + ^dwwiDsiDri + v, (1) 
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where v = argmin^^ (Rp s (h) + Rp T {h)) is the error of the best joint hypothesis, and 
d-HAn(Ds, Dt), called the %AH-distance between the domain marginal distributions, quantifies 
how hypothesis from H can "detect" differences between those two distributions. According to 
Equation ([]]), the lower this detection capability is for some given %, the better are the generaliza- 
tion guarantees. Hence, as pointed out in (2), Equation ([I]) together with the usual VC-bound theory, 
express a multiple trade-off between the accuracy of some particular hypothesis h, the complexity 
of the hypothesis class and the "incapacity" of hypothesis of H to detect difference between the 
source and the target domain. 



PAC-Bayesian Learning of Linear Classifier. The PAC-Bayesian theory, first introduced by 
McAllester 0, traditionally considers majority votes over a set % of binary hypothesis. Given a 
prior distribution tt over H and a training set S, the learning process consists in finding the posterior 
distribution p over H leading to a good generalization. Indeed, the essence of this theory is to bound 
the risk of the stochastic Gibbs classifier G p associated with p. In order to predict the label of an 
example x, the Gibbs classifier first draws a hypothesis h from H according to p, then returns /i(x) 
as the predicted label. Note that the error of the Gibbs classifier corresponds to the expectation of the 
errors over p: Rp s (G p ) = E/^ p Rp s (h). The classical PAC-Bayesian theorem bounds the expecta- 
tion of error Rp s (G p ) in term of two major quantities: The empirical error Rs(G p ) = E/^ p Rs(h) 

on a sample S and the Kullback-Leibler divergence KL(p || tt) = E/^ p $K) m 

Theorem 1 (as presented in [4 ]). For any domain Ps C X x Y, for any set H of hypothesis, for 
any prior distribution tt over H, and any 5 £ (0,1], we have, 

£(ra)" 



$ 8)m [ y P° nn : tt(Rs(G p )\\Rp 8 (G p j) < ^ 



KL(p||tt)+1ii- 



> 1 -6, 



wherekl( q \\p)^ q ln^(l- q )ln^ p , and £(m) ^ EZo (T) i 1 



Now, let H be a set of linear classifiers h w (x) = sgn ( v • x) such that v £ R d is a weight vector. 
By restricting the prior and the posterior to be Gaussian distributions, Langford an Shawe-Taylor Q 
have specialized the PAC-Bayesian theory in order to bound the expected risk of any linear classifier 
/i w £ H identified by a weight vector w. More precisely, for a prior 7r and a posterior p w defined 
as spherical Gaussians with identity covariance matrix respectively centered on vectors and w, i.e. 

foranyfcveH, 7r (M = (^"e^W 2 and MM = (7=)" e-*"—" 2 , (2) 
we obtain that the expected risk of the Gibbs classifier G Pw on a domain Ps is given by, 
R Ps (G p J= E E I(h v ^y)= E $(yp), 

where $(a) = \ [1 — Erf (-^=)]. Moreover, the KL-divergence between the posterior and the prior 
distributions becomes simply KL(p w || 7To) = ^||w|| 2 . In this context, Theorem [l] becomes, 
Corollary 1. For any domain Ps C R d x Y and any S £ (0, 1], we have, 



s Pr s)m (vw£^:kl(^( Gp J||^(G^ 



il|w|| 2 +ln 



C(m) 



>l-5. 



Based on this specialization of the PAC-Bayesian theory to linear classifiers, Germain et al. (4) 
suggested to minimize the bound on Rp s (G Pw ) given by Corollary [T] The resulting learning al- 
gorithm, called PBGD, performs a gradient descent in order to find an optimal weight vector w. 
Doing so, PBGD realizes a trade-off between the empirical accuracy (expressed by Rs(G Pw )) and 
the complexity (expressed by || w|| 2 ) of the learned linear classifier. 



PAC-Bayesian Learning of Adapted Linear Classifier 

DA Bound for the Gibbs Classifier. The originality of our contribution is to combine PAC- 
Bayesian and DA frameworks. We define the notion of domain disagreement dis p (Ds, D T ) to mea- 
sure the structural difference between domain marginals in terms of posterior distribution p ~ 

dis p (D s ,D T ) = E [R DT (huh 2 )-R Ds (h u h 2 )} , 
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where RD'{hi, h 2 ) = E x ^£>/ I{h\{x) ^ h 2 {x)). Unlike the distance dy^i suggested by 0, our 
"distance" measure dis p takes into account a p-average over all pairs of hypothesis in H instead of 
focusing on a single particular pair of hypothesis. However, it nevertheless allows us to derive the 
following bound which proposes a similar trade-off as in Equation ([]]), but relates the source and 
target errors of the Gibbs classifier. For all probability distribution p on H, we have, 

Rp T (G p ) < R Ps (G p ) + dis p (Ds,D T ) + \ p: (3) 

where \ p = Rp T (h*) + i2p s (ft*), with h* = argmin /lG ^ {Eh'~ p (Rp> T (h, h f )—RD s (h, h'))}, mea- 
sures the joint error of the hypothesis which minimizes the domain disagreement. Hence, similarly 
to Equation ([I]), we provide evidences that a good DA is possible if dis p (Ds, D T ) and \ p are low. 
Under this assumption, we propose to design the first DA-PAC-Bayesian algorithm inspired from 
the PAC-Bayesian learning of linear classifiers (4). We focus on the two first terms of Inequality ([3]), 
and we refer to this quantity as the expected adaptation loss, 

Bp {s , T) (G p ) = R Ps (G p )+dis p (D s ,D T ), 
where P(s,t) denotes the joint distribution over Ps x D T . The independence of each draw from Ps 
and Dp allows us to rewrite Bp {s T) as the expectation of the domain adaptation loss £da, 



B } 



(S,T) 



(G p 



E E 

h ly h 2 ~p 2 (x s ,y s ,x t )~P ( 



C DA (h 1 ,h 2 ,x s ,y s ,x t ) , 



(4) 



(S,T) 



Cda^M^.V 8 ^) = I(hi(* s ) + yn+Hhii^) + ft 2 (x*)) - J(hi(* a ) ± h 2 (x s )) • 
Given (5, T) = {(x|, yf, x*)}£L l9 a sample of m source-target pairs drawn Ltd. from P(s,t)> me 
empirical adaptation loss of G p is B( S ,t)(G p ) = E^^^ YhLi Ada(^i> ^2, x|, ?/f, x*). 

PAC-Bayesian Bounds for Domain Adaptation. We restrict ourselves to the case exhibited by 
Equation ^ where H is a set of linear classifiers, and posterior and prior distributions are Gaussians. 
First, we compute the expected adaptation loss Bp T) (G Pw ) of the Gibbs classifier G Pw (remember 



that the posterior distribution is centered on the linear ft w ). With $dis(&) = 



B P(S,T)(G pw ) = 



E 

(x s ,r,X*)~P (5 ,T> 



2$(a)$(— a), we obtain, 



Now, we derive a new PAC-Bayesian theorem to bound the expected adaptation loss of linear classi- 
fiers. Theorem[2]is obtained by two key results. First, we use the specialization of the PAC-Bayesian 
theory to linear classifiers introduced by Corollary [T] Second, we need the methodology developed 
by (H Theorem 5] to bound a loss relying on a pair of hypothesis hi, h 2 ~ p 2 (like our domain 
adaptation loss of Equation ([?])). We then obtain KL(p^ || 7Tq) = 2 KL(p w || 7r ) = ||w|| 2 . 
Theorem 2. For any domain P(s,t) ^ 

fVw e R d : kl(B 



(S,T), 



Pr 

^(P(S,T) 



(S,T) 



xYxR d and any S e (0, 1], we have, 

£(m) 



B 



* \ 

P {S,T) J 



< 



w 



In 



> 1 -5, 



where B* (ST) = \B {S , T )(G P J + \ and B\ 



J (S,T) — 2 

provided to the kl( 



<S,T> 



def 1 R 



(S,T) 



t) (Gp w ) + 1 ensure that the values 



•) function are in interval [0, 1]. 



Designing the Algorithm. The algorithm DA-PBGD, described here, minimizes the upper bound 
given by Theorem [2] by gradient descent. The corresponding objective function is, 



B((S.T),w.S) M supje : kl(fl { * SiT > || e) < 



w 



In 



£(m) 



for a fixed value of 5. Consequently, our problem is to find weight vector w* that minimizes B 



subject to the constraints B > B1 S T ^ and k\(B* s T ^ \\B) 



w 



In 



am) 



The gradient is 



obtained by computing the partial derivative of both sides of the latter equation with respect to Wj 
(the j th component of w). After solving for dB/dwj, we find that the gradient is, 



B(l-B) 



2m(B- 



m r 



where 3>'(a) and ^ f dis (a) denote respectively the derivatives of $ and <l>dis evaluated at a. The kernel 
trick applied to DA-PBGD allows us to work with dual weight vector a E W l that is a linear classifier 



in an augmented space. Given a kernel k : 



we have h w (x) = Yh=i a iH x ii x )- 
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Figure 1: Illustration of the decision of DA-PBGD on 4 rotations angles: From left to right 20°, 30°, 
40°, 50°. In green and pink is the source sample, in grey is the target sample. 



Experimental Results. Our DA-PBGD has been evaluated on a toy problem called inter-twinning 
moon and compared with: PBGD and SVM with no adaptation, the semi- supervised Transductive- 
SVM (TSVM) [7], the iterative DA algorithms DASVM |8] and the non-iterative version of 
DASF [9] based on the bound ([T]). We used a Gaussian kernel for all the methods. These pre- 
liminary results - illustrated on Tab. [T]and on Fig. [T]- are very promising. Moreover on Fig. |2| we 
clearly see the trade-off between the difficulty of the task and the minimization of the source risk in 
action: When the DA task is feasible DA-PBGD prefers to minimize the domain disagreement even 
if it implies an increase of the empirical source error, but when this minimization becomes hard, i.e. 
the complexity of the task is high, it prefers to focus only on the empirical source error. 

Among all the possible exciting perspectives, we notably aim to theoretically define elegant and 
relevant assumptions allowing one to control the X p term of Eq. ^ to make our DA bound very tight. 



Table 1 : Average accuracy results for 4 rotation an- 
gles. DA-PBGD is more stable than the others and 
outperforms all the methods for 2 angles. 



Rotation angle 


20° 


30° 


40° 


50° 


PBGD 


99.5 


89.8 


78.6 


60 


SVM 


89.6 


76 


68.8 


60 


TSVM 


100 


78.9 


74.6 


70.9 


DASVM 


100 


78.4 


71.6 


66.6 


DASF 


98 


92 


83 


70 


DA-PBGD 


97.7 


97.6 


97.4 


53.2 
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0.1 




-source error 
-target error 




£5 70 8*0 90 



Figure 2: The trade-off between target 
and source errors according to the diffi- 
culty of the task (i.e. the rotation angle). 
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